We have existing observations
\[(x_1, C_1), ... (x_n, C_n)\]
where the \(C_i\) are categories.
Given a new observation \(x_{new}\), how do we predict \(C_{new}\)?
LDA Come up with a “cutoff”: if \(x_{new} >\) cutoff, predict class A, if not, predict class B.
# A tibble: 4 × 3
Class pred_class n
<chr> <chr> <int>
1 A A 839
2 A B 161
3 B A 166
4 B B 834
In what scenario would we choose an “uneven” cutoff?
To perform classification with Linear Discriminant Analysis, we choose the best dividing line between the two classes.
The Big Questions
What is our definition of best?
What if we allow the line to “wiggle”?
Let’s keep hanging out with the insurance dataset.
Suppose we want to use information about insurance charges to predict whether someone is a smoker or not.
Quick Quiz
What do we have to change?
The model?
The recipe?
The workflow?
The fit?
What if we want to use more than one predictor?
age as a predictorparsnip model object
Call:
lda(smoker ~ charges + age, data = data)
Prior probabilities of groups:
no yes
0.7981 0.2019
Group means:
charges age
no 7528 38.30
yes 31152 36.62
Coefficients of linear discriminants:
LD1
charges 0.0001718
age -0.0449953
\[\text{Score} = 0.001718 \times \text{charges} -0.0444 \times \text{age}\]
Predict “smoker” if Score > 0
\[\text{Score} = 0.001718 \times \text{charges} -0.0444 \times \text{age}\]
\[0 = 0.001718 \times \text{charges} -0.0444 \times \text{age}\]
\[\text{age} = \frac{0.001718}{0.0444} \times \text{charges}\]
\[\text{age} = 0.03869 \times charges\]
Open Activity-Classification-LDA.qmd
Select the best LDA model for predicting smoker status.
Compare the accuracy to your KNN and Logistic Regression models (from last class).
One more time: wiggly style
What if we allow the separating line to be non-linear?
In this case, we allow the data in the different categories to have different variances.
Open Activity-Classification-LDA.qmd again
Select the best QDA model
Compare to prior models